Building on Linear Regression

#analysis #data #math #notes #processing #python #regression #sklearn

Linear regression is a simple and widely used approach for modeling the relationship between a dependent variable and one or more independent variables. However, it assumes a linear relationship between the independent and dependent variables. When dealing with non-linear data, we can extend linear regression by incorporating non-linear transformations of the features.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# Generate sample non-linear data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Transform features to higher degrees
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)

# Fit linear regression model on transformed features
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)

# Predictions
y_pred = lin_reg.predict(X_poly)

Handling Non-Linear Data

Non-linear data can be handled in various ways, including polynomial regression, spline regression, or using non-linear models like decision trees or neural networks. Polynomial regression fits a polynomial function to the data, allowing for non-linear relationships between variables.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np

# Generate non-linear data
np.random.seed(0)
X = 6 * np.random.rand(100, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(100, 1)

# Transform features to higher degrees
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)

# Fit polynomial regression model
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)

# Predictions
y_pred = lin_reg.predict(X_poly)

Regularization:

Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function. Two common types of regularization are Ridge Regression (L2 regularization) and Lasso Regression (L1 regularization).

from sklearn.linear_model import Ridge, Lasso
import numpy as np

# Generate sample data
np.random.seed(0)
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)

# Fit Ridge regression model
ridge_reg = Ridge(alpha=1)
ridge_reg.fit(X, y)

# Fit Lasso regression model
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)

Geometric Representation of Ridge and Lasso:

Ridge and Lasso regression can be represented geometrically as constraints on the coefficients. In Ridge regression, the constraint is a L2 norm, which forms a circular constraint region. In Lasso regression, the constraint is a L1 norm, which forms a diamond-shaped constraint region.

  • Lasso (L1): It’s like saying, “I want the blanket to fit well, but I also want it to be smooth with as few folds as possible.” Each fold represents a feature in our model, and L1 regularization tries to minimize the number of folds. In the end, you’ll have a blanket that fits the bed well but might not get into every nook and cranny, especially if those are just noise. In the geometric visual, Lasso would try to keep the plane well-fitted to the points but might flatten out in the direction of less important variables (shrinking coefficients to zero).
  • Ridge (L2): This is like wanting a snug fit but also wanting the blanket to be evenly spread out without any part being too far from the bed. So even though the blanket still fits the bed closely, it won’t get overly contorted to fit the minor bumps. In the geometric visual, Ridge adds a penalty that constrains the coefficients, shrinking them towards zero but not exactly to zero. This keeps the plane close to the points but prevents it from tilting too sharply to fit any particular points too closely, thus maintaining a bit of a distance (bias) to prevent overfitting to the noise.